Brain Stroke Dataset - Analysis, Part II

Author: Jakub Bednarz

Previous parts: Part I.

Introduction

In this report, we shall continue the analysis of a brain stroke dataset by computing variable attributions for the predictors, based on the predictive models we have fit previously, to see which are the most important. To this end, we will use shap and dalex libraries to compute Shapley Additive Explanations (SHAP values).

The agenda is as follows:

  1. First, we will look at what Shapley values are, and how they can be used to formalize the notion of variable importance.
  2. We will decompose the predictions of a tree-based ensemble model on a number of samples from a dataset. In particular, we will show how different observations may differ in the predictors which drive the final judgement of the model, and how the same predictor may have a radically different effect on the final prediction.
  3. We will check if, and how, different software packages compute the SHAP values. If such a phenomenon occurs, it would be important to understand the causes of said divergence, such as approximations and simplifications made by different tools, in order to properly interpret the results given by them.
  4. We will compare the attributions computed from the predictions made by a model of a different class, in this case one based on logistic regression, to better understand how different types of models arrive at the results they arrive at.

Shapley values

Shapley values comprise a game-theoretical concept for assigning importance to individual players in cooperative games. A cooperative game consists of a set of players $P$ and a "payoff function" $v: 2^P \to \mathbb{R}$ which assigns a payoff to each coalition, coalition being a subset of players. One measure of a significance of a single player $p \in P$ could be the surplus $v(S \cup \{p\}) - v(S)$ in the payoff when joining some coalition $p \not\in S$. Shapley values represent an average of this surplus across all coalitions and across all ways in which a particular coalition may arise when players form them one by one. For more detailed information, including the formulae for computing Shapley values, one may read the Wikipedia page.

Example

To gain some further intuition about the Shapley values, let's compute them for a simple example of a three-player coalitional game. Let the payoff function $v$ for this game be: $$ \begin{align*} v() &= 0\\ v(A) &= 20\\ v(B) &= 20\\ v(C) &= 60\\ v(A, B) &= 60\\ v(A, C) &= 70\\ v(B, C) &= 70\\ v(A, B, C) &= 100 \end{align*} $$

Then, we can compute the Shapley value $\phi_A(v)$ in a following fashion: $$ \begin{align*} \phi_A(v) &= \frac{1}{\text{\# of orders of players}} \sum_{\text{orders of players}} (v(\text{$A$ and players preceding $A$}) - v(\text{players preceding $A$)})\\ &= \frac{1}{6}(2[v(A) - v()] + [v(B, A) - v(B)] + [v(B, C, A) - v(B, C)]\\ &+ [v(C, A) - v(C)] + [v(C, B, A) - v(C, B)])\\ &= 25 \end{align*}$$ which we can interpret as $A$ bringing, on average, a $25$ surplus to a randomly formed coalition.

SHAP values - An application of Shapley values in XAI

Shapley values provide an attractive framework for dealing with feature importances. If we view a predictive model as a cooperative game between the predictor variables, with the "payoff" being the response variable, then $\phi$'s encode many desirable properties, for example $\phi_x = 0$ signifies feature $x$ being irrelevant to the prediction. However, due to the combinatorial definition of the Shapley values, they can be prohibitively expensive to compute, which gives rise to various model-type-specific approximations, such as TreeSHAP.

Variable attributions for brain stroke dataset

Having gained an understanding of how what the SHAP values are, let's now see how they can be used for explaining the predictions of our models for predicting brain stroke. We will use two libraries:

For now, though, we will use the shap library. It has a number of plotting options for visualizing the attributions of the variables. To get a high-level overview, we can use the summary plot:

Here, we can see how different predictor variables affect the output variable. For example, if we look at the hypertension variable, we can see that if the value of this variable is high, as indicated by the red color, its SHAP value is also likely to be high, increasing the chance of stroke. Likewise, low value of age (blue color) is correlated with the chance of a brain stroke being lower than average. Features such as gender_Male or Residence_type_Urban do not seem to have any effect on the likelihood.

We can investigate the impact of a single variable with a dependency plot. Let us plot it for age.

Here we can clearly see that, as the age increases, so does the SHAP value. For example, one conclusion we can draw is that up until the age of around 50 the chance of getting a brain stroke is lower by around 10-20 percentage points (compared to the baseline of 30% for our model), whereas from about the age of 70 the chance is higher by about 30 percentage points.

Analyzing observations

Another dimension along which we can analyze the results in greater detail is to look at the variable attributions for a single observation. Let's therefore pick some random observation:

1778
gender_Male 0.00
ever_married_Yes 1.00
work_type_Govt_job 0.00
work_type_Private 1.00
work_type_Self-employed 0.00
work_type_children 0.00
Residence_type_Urban 0.00
smoking_status_Unknown 0.00
smoking_status_formerly smoked 0.00
smoking_status_never smoked 1.00
smoking_status_smokes 0.00
age 36.00
hypertension 0.00
heart_disease 0.00
avg_glucose_level 100.33
bmi 23.20

We will take a look at the results of our model for a married 36-year-old woman with a fairly normal BMI index of 23.2 - we might expect the probability to be very low.

And it is indeed quite low, around 8%. From this diagram we can derive a number of other useful obsevations:

  • the most significant factor behind the result is age being low, which reduces the chance by 17 percentage points;
  • the other factors, by themselves, don't seem to affect the result all that much;
  • the only indicator that increases the chance of stroke is her being married, interestingly enough. In fact, let's look at the plot for the marital status:

Not being married seems to consistently reduce the chance by "a fair bit", whereas being in marriage increases it slightly. (As for why that might be the case, I don't know, but I thought it interesting.)

Features with different attributions

As we have seen from the plots, variables might have positive or negative attributions, depending on its value. Let's for example take a look at two records:

4379 2473
gender_Male 1.00 1.00
ever_married_Yes 0.00 1.00
work_type_Govt_job 0.00 0.00
work_type_Private 0.00 1.00
work_type_Self-employed 1.00 0.00
work_type_children 0.00 0.00
Residence_type_Urban 0.00 1.00
smoking_status_Unknown 1.00 0.00
smoking_status_formerly smoked 0.00 1.00
smoking_status_never smoked 0.00 0.00
smoking_status_smokes 0.00 0.00
age 31.00 66.00
hypertension 0.00 1.00
heart_disease 0.00 0.00
avg_glucose_level 64.85 82.91
bmi 23.00 28.90

The variable attributions for them are:

And indeed, in the first case age, hypertension and bmi, among others, have negative attributions, their value indicating reduced chance of a stroke, whereas in the second case all of these variables point to increased possibility of the attack.

Observations having different variables of highest impact

Another phenomenon we can observe is different observations having different variables which best explain the predicted outcome. For example, for:

2764 1071
gender_Male 0.00 0.00
ever_married_Yes 1.00 1.00
work_type_Govt_job 0.00 0.00
work_type_Private 0.00 1.00
work_type_Self-employed 1.00 0.00
work_type_children 0.00 0.00
Residence_type_Urban 1.00 1.00
smoking_status_Unknown 0.00 0.00
smoking_status_formerly smoked 1.00 1.00
smoking_status_never smoked 0.00 0.00
smoking_status_smokes 0.00 0.00
age 82.00 49.00
hypertension 1.00 0.00
heart_disease 0.00 0.00
avg_glucose_level 107.21 67.55
bmi 27.00 17.60

we get:

Or, in other words, for the first subject variables of greatest import in explaining the results are age and hypertension, whereas for the other bmi and ever_married_Yes play a greater role - notably, for the second person age has a fairly small impact on the prediction, even though in almost all previous studied cases we've seen it at or near the top of the list.

Comparing two predictive models with SHAP

SHAP values grant us an understanding of how a predictive model arrives at its conclusions. Naturally, we can also use it for the purposes of comparing different types of models - this may help us, for example, decide which one of them we ought to use, beyond such standard criteria as accuracy, precision or recall. To be specific, we will now look at the predictions and SHAP values for a logistic regression model.

One important thing to note are the different units between the tree model and the logistic regression model - the latter outputs log odds, which we can convert to raw probability with the sigmoid function. Notwithstanding, it's pretty clear that the behavior of the model is wildly different, at least for these observations. For example, the SHAP values for the logistic regression seem to have a way of trying to balance out the dummy variables (like smoking_status_formerly_smoked) which we've introduced to convert them to something intelligible by the model. Another thing is that it seems to not put as great a weight on the BMI value as the tree model, it being evidenced by not showing up on either of the plots, which we would expect a predictive model to take into account. Let's in fact look at the summary plot for the model:

It would seem, glacing at it, that the dummy variables get assigned a SHAP value depending on the variable value alone, and variables like age have the impact more "spread out". Personally, I found the results for the tree model more "natural" and interpretable, which could be a factor in choosing it over the linear model - as we might recall from Part I, the AUROC scores for these two models were very similar, so if we didn't take into account how these models arrive at their results, we could make a choice that e.g. could introduce anomalous results.

Comparing different packages for computing and visualizing SHAP values

To finish off the analysis, we will now compare packages. So far we have focused on shap package - we will now compare it with dalex to briefly check whether the results are the same, and also to see how one might approach visualization of the results differently.

I would say that (1) the attributions seem broadly similar, although I will note that for the second observation being a former smoker was given a lesser impact in dalex than in shap, among other differences; (2) the plots (well, this particular type of plot) are fairly visually similar. I suppose now would be the time and the place to recommend one of them, but I haven't explored them to a degree which would allow me to make a learned judgement.